archived/identify_key_insights_from_textual_document/document_entity

{ "cells": [ { "cell_type": "markdown", "id": "c3be9d48", "metadata": {}, "source": [ "# Document Understanding Solution - Name Entity Recognition\n" ] }, { "cell_type": "markdown", "id": "74af46b4", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "ca69ffcc", "metadata": {}, "source": [ "\n", "Named entity recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. In this notebook, we demonstrate three use cases of Name Entity Recognition:\n", "\n", "1. How to directly deploy a pretrained Transformer-based name entity recognition model to perform inference.\n", "2. How to fine-tune a pre-trained Transformer model on a custom dataset, and then run inference on the fine-tuned model.\n", "3. How to run [SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) (a hyperparameter optimization procedure) to find the best model compared with the model fine-tuned in point 2. The performance of the optimal model and model fine-tuned in point 2 is evaluated on a hold-out test data. \n", "\n", "**Note**: When running this notebook on SageMaker Studio, you should make\n", "sure the `PyTorch 1.10 Python 3.8 CPU Optimized` image/kernel is used. When\n", "running this notebook on SageMaker Notebook Instance, you should make\n", "sure the 'sagemaker-soln' kernel is used." ] }, { "cell_type": "markdown", "id": "d30de33b", "metadata": {}, "source": [ "## 1. Set Up\n", "\n", "Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets." ] }, { "cell_type": "code", "execution_count": null, "id": "9f956b38", "metadata": {}, "outputs": [], "source": [ "!pip install -U sagemaker ipywidgets datasets seqeval" ] }, { "cell_type": "markdown", "id": "90b3b484", "metadata": {}, "source": [ "We start by importing a variety of packages that are used throughout\n", "the notebook. One of the most important packages is the Amazon SageMaker\n", "Python SDK (i.e. `import sagemaker`). We also import modules from our own\n", "custom (and editable) package that can be found at `../package`." ] }, { "cell_type": "code", "execution_count": null, "id": "766e7e1e", "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import boto3\n", "import sagemaker\n", "from sagemaker.pytorch import PyTorchModel\n", "import sys\n", "from sagemaker.huggingface import HuggingFace\n", "from sagemaker.huggingface import HuggingFaceModel\n", "\n", "import config\n", "\n", "IAM_ROLE = sagemaker.get_execution_role()\n", "aws_region = boto3.Session().region_name\n", "sess = sagemaker.Session()\n", "DEFAULT_BUCKET = sess.default_bucket()" ] }, { "cell_type": "markdown", "id": "67b12c27", "metadata": {}, "source": [ "Up next, we define the current folder and create a SageMaker client (from\n", "`boto3`). We can use the SageMaker client to call SageMaker APIs\n", "directly, as an alternative to using the Amazon SageMaker SDK. We use\n", "it at the end of the notebook to delete certain resources that are\n", "created in this notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "bd6736f4", "metadata": {}, "outputs": [], "source": [ "current_folder = config.get_current_folder(globals())\n", "sagemaker_client = boto3.client(\"sagemaker\")" ] }, { "cell_type": "markdown", "id": "dbda7a4e", "metadata": {}, "source": [ "## 2. Run inference on the pre-trained name entity recognition model" ] }, { "cell_type": "markdown", "id": "33195063", "metadata": {}, "source": [ "We use the unique solution prefix to name the model and endpoint. Up next, we need to define the Amazon SageMaker Model which references\n", "the source code and the specifies which container to use. \n", "\n", "This is a Named Entity Generation model [En_core_web_md](https://spacy.io/models/en#en_core_web_md) from the [spaCy](spacy.io) library. It takes a text string as input and predicts named entities in the input text. \n", "\n", "The pre-trained model from the spaCy library doesn't rely on a specific deep learning framework. Just for consistency with the other notebooks we continue to use the PyTorchModel from the Amazon SageMaker Python SDK. Using PyTorchModel and setting the framework_version argument, means that our deployed model runs inside a container that has PyTorch pre-installed. Other requirements can be installed by defining a `requirements.txt` file at the specified source_dir location. We use the `entry_point` argument to reference the code (within `source_dir`) that should be run for model inference: functions called `model_fn`, `input_fn`, `predict_fn` and `output_fn` are expected to be defined. And lastly, you can pass `model_data` from a training job, but we are going to load the pre-trained model in the source code running on the endpoint. We still need to provide `model_data`, so we pass an empty archive." ] }, { "cell_type": "code", "execution_count": null, "id": "f6e75f1d", "metadata": {}, "outputs": [], "source": [ "import uuid\n", "\n", "unique_hash = str(uuid.uuid4())[:6]\n", "endpoint_name = f\"{config.SOLUTION_PREFIX}-{unique_hash}-entity-recognition-endpoint\"" ] }, { "cell_type": "markdown", "id": "36c7bc6a", "metadata": {}, "source": [ "### 2.1. Deploy an endpoint" ] }, { "cell_type": "code", "execution_count": null, "id": "6634fa7b", "metadata": {}, "outputs": [], "source": [ "model = PyTorchModel(\n", " model_data=f\"{config.SOURCE_S3_PATH}/artifacts/models/empty.tar.gz\",\n", " entry_point=\"entry_point.py\",\n", " source_dir=\"containers/entity_recognition\",\n", " role=IAM_ROLE,\n", " framework_version=\"1.5.0\",\n", " py_version=\"py3\",\n", " code_location=\"s3://\" + DEFAULT_BUCKET + \"/code\",\n", " env={\"MMS_DEFAULT_RESPONSE_TIMEOUT\": \"3000\"},\n", ")" ] }, { "cell_type": "markdown", "id": "f71ae4f3", "metadata": {}, "source": [ "Using this Amazon SageMaker Model, we can deploy a HTTPS endpoint on a\n", "dedicated instance. We choose to deploy the endpoint on a single\n", "ml.p3.2xlarge instance (or ml.g4dn.2xlarge if unavailable in this\n", "region). You can expect this deployment step to take\n", "around 5 minutes. After approximately 15 dashes, you can expect to see an\n", "exclamation mark which indicates a successful deployment." ] }, { "cell_type": "code", "execution_count": null, "id": "2d5856db", "metadata": {}, "outputs": [], "source": [ "import time\n", "from sagemaker.serializers import JSONSerializer\n", "from sagemaker.deserializers import JSONDeserializer\n", "\n", "predictor = model.deploy(\n", " endpoint_name=endpoint_name,\n", " instance_type=config.HOSTING_INSTANCE_TYPE,\n", " initial_instance_count=1,\n", " serializer=JSONSerializer(),\n", " deserializer=JSONDeserializer(),\n", ")\n", "\n", "time.sleep(10)" ] }, { "cell_type": "markdown", "id": "6c10bf39", "metadata": {}, "source": [ "When you're trying to update the model for development purposes, but\n", "experiencing issues because the model/endpoint-config/endpoint already\n", "exists, you can delete the existing model/endpoint-config/endpoint by\n", "uncommenting and running the following commands:" ] }, { "cell_type": "code", "execution_count": null, "id": "f91e86ed", "metadata": {}, "outputs": [], "source": [ "# sagemaker_client.delete_endpoint(EndpointName=endpoint_name)\n", "# sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)" ] }, { "cell_type": "markdown", "id": "9741244a", "metadata": {}, "source": [ "When calling our new endpoint from the notebook, we use a Amazon\n", "SageMaker SDK\n", "[`Predictor`](https://sagemaker.readthedocs.io/en/stable/predictors.html).\n", "A `Predictor` is used to send data to an endpoint (as part of a request),\n", "and interpret the response. Our `model.deploy` command returned a\n", "`Predictor` but, by default, it send and receive numpy arrays. Our\n", "endpoint expects to receive (and also sends) JSON formatted objects, so\n", "we modify the `Predictor` to use JSON instead of the PyTorch endpoint\n", "default of numpy arrays. JSON is used here because it is a standard\n", "endpoint format and the endpoint response can contain nested data\n", "structures." ] }, { "cell_type": "markdown", "id": "c0900206", "metadata": {}, "source": [ "### 2.2. Example input sentences for inference & Query endpoint" ] }, { "cell_type": "markdown", "id": "d9194c06", "metadata": {}, "source": [ "With our model successfully deployed and our predictor configured, we can\n", "try out the entity recognizer out on example inputs. All we need to do is\n", "construct a dictionary object with a single key called `text` and provide\n", "the the input string. We call `predict` on our predictor and we should\n", "get a response from the endpoint that contains our entities." ] }, { "cell_type": "code", "execution_count": null, "id": "4783418b", "metadata": {}, "outputs": [], "source": [ "data = {\n", " \"text\": \"Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.\"\n", "}\n", "response = predictor.predict(data=data)" ] }, { "cell_type": "markdown", "id": "9170d085", "metadata": {}, "source": [ "We have the responce and we can print out the named entities and noun\n", "chunks that have been extracted from the text above. You can see the\n", "verbatim text of each alongside its location in the original text (given\n", "by start and end character indexes). Usually a document contains many\n", "more noun chunks than named entities, but named entities have an\n", "additional field called `label` that indicates the class of the named\n", "entity. Since the spaCy model was trained on the OneNotes 5 corpus, it\n", "uses the following classes:\n", "\n", "| TYPE | DESCRIPTION |\n", "|---|---|\n", "| PERSON | People, including fictional. |\n", "| NORP | Nationalities or religious or political groups. |\n", "| FAC | Buildings, airports, highways, bridges, etc. |\n", "| ORG | Companies, agencies, institutions, etc. |\n", "| GPE | Countries, cities, states. |\n", "| LOC | Non-GPE locations, mountain ranges, bodies of water. |\n", "| PRODUCT | Objects, vehicles, foods, etc. (Not services.) |\n", "| EVENT | Named hurricanes, battles, wars, sports events, etc. |\n", "| WORK_OF_ART | Titles of books, songs, etc. |\n", "| LAW | Named documents made into laws. |\n", "| LANGUAGE | Any named language. |\n", "| DATE | Absolute or relative dates or periods. |\n", "| TIME | Times smaller than a day. |\n", "| PERCENT | Percentage, including ”%“. |\n", "| MONEY | Monetary values, including unit. |\n", "| QUANTITY | Measurements, as of weight or distance. |\n", "| ORDINAL | “first”, “second”, etc. |\n", "| CARDINAL | Numerals that do not fall under another type. |" ] }, { "cell_type": "code", "execution_count": null, "id": "a7653183", "metadata": {}, "outputs": [], "source": [ "print(response[\"entities\"])\n", "print(response[\"noun_chunks\"])" ] }, { "cell_type": "markdown", "id": "d464d2a7", "metadata": {}, "source": [ "You can try more examples above, but note that this model has been\n", "pretrained on the OneNotes 5 dataset. You may need to fine-tune this\n", "model with your own question answering data to obtain better results." ] }, { "cell_type": "markdown", "id": "ae8c3026", "metadata": {}, "source": [ "### 2.3. Clean up the endpoint\n", "\n", "When you've finished with the summarization endpoint (and associated\n", "endpoint-config), make sure that you delete it to avoid accidental\n", "charges." ] }, { "cell_type": "code", "execution_count": null, "id": "93e401ac", "metadata": {}, "outputs": [], "source": [ "# Delete the SageMaker endpoint and the attached resources\n", "predictor.delete_model()\n", "predictor.delete_endpoint()" ] }, { "cell_type": "markdown", "id": "e1532026", "metadata": {}, "source": [ "## 3. Finetune the pre-trained model on a custom dataset" ] }, { "cell_type": "markdown", "id": "0d40f354", "metadata": {}, "source": [ "Previously, we saw how to run inference on a pre-trained name entity recognition model. Next, we discuss how a model can be finetuned to a custom dataset. \n", "\n", "The model for fine-tuning attaches an token classification layer on each token embeddings outputted by the Text Embedding model\n", "and initializes the layer parameters to random values. The fine-tuning step fine-tunes \n", "all the model parameters to minimize prediction error on the input data and returns the fine-tuned model. The Text Embedding model we use in this demonstartion is [Distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) from FuggingFace. The dataset we fine-tune the model is [WikiANN](https://github.com/afshinrahimi/mmner) (which is also known as PAN-X english dataset. The WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with `LOC` (location), `PER` (person), and `ORG` (organisation) tags in the [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).\n", "\n", "\n", "The model returned by fine-tuning can be further deployed for inference. Below are the instructions \n", "for how the training data should be formatted for input to the model. \n", "\n", "- **Input:** A directory containing a `txt` format file.\n", " - The first column of the `txt` format file should have tokens parsed from sentence.\n", " - The second column should have the corresponding name entity tag.\n", "- **Output:** A trained model that can be deployed for inference. \n", " \n", "Below is an example of `txt` format file showing values in its first four columns. Note that the file should not have any header. For the prefix of `B`, `I`, `O` of the tag, please check the [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) for details. The data for training and validation are be downloaded into directory `../data/wikiann` in the following section.\n", "\n", "| | | \n", "|--- |---|\n", "| R.H. | B-ORG |\n", "|Saunders| I-ORG |\n", "|(| O |\n", "|St.| B-ORG |\n", "|Lawrence| I-ORG |\n", "|River| I-ORG |\n", "|)| O |\n", "|(| O |\n", "|... | ... |\n", " \n", "\n", "The WikiANN dataset is downloaded from [Dataset Homepage](https://github.com/afshinrahimi/mmner). [Apache 2.0 License](https://creativecommons.org/licenses/by-sa/4.0/legalcode).\n", "\n", "Citation:\n", "@inproceedings{rahimi-etal-2019-massively,\n", " title = \"Massively Multilingual Transfer for {NER}\",\n", " author = \"Rahimi, Afshin and\n", " Li, Yuan and\n", " Cohn, Trevor\",\n", " booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n", " month = jul,\n", " year = \"2019\",\n", " address = \"Florence, Italy\",\n", " publisher = \"Association for Computational Linguistics\",\n", " url = \"https://www.aclweb.org/anthology/P19-1015\",\n", " pages = \"151--164\",\n", "}\n" ] }, { "cell_type": "markdown", "id": "9bce6081", "metadata": {}, "source": [ "### 3.1. Download, preprocess, and upload the training data" ] }, { "cell_type": "code", "execution_count": null, "id": "38f1caef", "metadata": {}, "outputs": [], "source": [ "!aws s3 cp --recursive $config.SOURCE_S3_PATH/artifacts/data/wikiann/ data/wikiann" ] }, { "cell_type": "markdown", "id": "927d4fd4", "metadata": {}, "source": [ "The dataset has been partitioned into `train.txt`, `dev.txt`, and `test.txt` data. Thus we don't need split the train data as what we do in previous notebooks. The`train.txt` and `dev.txt` are used as training and validation data. The `test.txt` is used as hold-out test data to evaluate model performance with / without hyperparameter optimization. Next, we upload them into S3 path which are used as input for training." ] }, { "cell_type": "code", "execution_count": null, "id": "a4de77ef", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "prefix = \"NER\"\n", "\n", "boto3.Session().resource(\"s3\").Bucket(DEFAULT_BUCKET).Object(\n", " os.path.join(prefix, \"train/data.txt\")\n", ").upload_file(\"data/wikiann/train/train.txt\")\n", "\n", "boto3.Session().resource(\"s3\").Bucket(DEFAULT_BUCKET).Object(\n", " os.path.join(prefix, \"validation/data.txt\")\n", ").upload_file(\"data/wikiann/validation/dev.txt\")" ] }, { "cell_type": "markdown", "id": "03ff0d30", "metadata": {}, "source": [ "### 3.2. Set Training parameters\n", "\n", "Now that we are done with all the setup that is needed, we are ready to fine-tune our name entity recognition model." ] }, { "cell_type": "code", "execution_count": null, "id": "9459823b", "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"pretrained-model\": \"distilbert-base-uncased\",\n", " \"learning-rate\": 2e-6,\n", " \"num-train-epochs\": 2,\n", " \"batch-size\": 16,\n", " \"weight-decay\": 1e-5,\n", " \"early-stopping-patience\": 2,\n", "}" ] }, { "cell_type": "markdown", "id": "7e8deb21", "metadata": {}, "source": [ "### 3.3. Fine-tuning without hyperparameter optimization\n", "\n", "We use the HuggingFace from the Amazon SageMaker Python SDK. The entry script is located under `../containers/entity_recognition/finetuning/training.py`" ] }, { "cell_type": "code", "execution_count": null, "id": "4b15aedb", "metadata": {}, "outputs": [], "source": [ "training_job_name = training_job_name = f\"{config.SOLUTION_PREFIX}-ner-finetune\"\n", "\n", "training_instance_type = config.TRAINING_INSTANCE_TYPE\n", "\n", "ner_estimator = HuggingFace(\n", " pytorch_version=\"1.10.2\",\n", " py_version=\"py38\",\n", " transformers_version=\"4.17.0\",\n", " entry_point=\"training.py\",\n", " source_dir=\"containers/entity_recognition/finetuning\",\n", " hyperparameters=hyperparameters,\n", " role=IAM_ROLE,\n", " instance_count=1,\n", " instance_type=training_instance_type,\n", " output_path=f\"s3://{DEFAULT_BUCKET}/{prefix}/output\",\n", " code_location=f\"s3://{DEFAULT_BUCKET}/{prefix}/output\",\n", " tags=[{\"Key\": config.TAG_KEY, \"Value\": config.SOLUTION_PREFIX}],\n", " sagemaker_session=sess,\n", " volume_size=30,\n", " env={\"MMS_DEFAULT_RESPONSE_TIMEOUT\": \"500\"},\n", " base_job_name=training_job_name,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "a4f74b46", "metadata": {}, "outputs": [], "source": [ "ner_estimator.fit(\n", " {\n", " \"train\": f\"s3://{DEFAULT_BUCKET}/{prefix}/train/\",\n", " \"validation\": f\"s3://{DEFAULT_BUCKET}/{prefix}/validation/\",\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "b242d5db", "metadata": {}, "source": [ "## 3.4. Deploy & run Inference on the fine-tuned model\n", "\n", "A trained model does nothing on its own. We now want to use the model to perform inference. For this example, it means predicting the entity tag of an input text. " ] }, { "cell_type": "code", "execution_count": null, "id": "db444951", "metadata": {}, "outputs": [], "source": [ "inference_instance_type = config.HOSTING_INSTANCE_TYPE\n", "endpoint_name_finetune = (\n", " f\"{config.SOLUTION_PREFIX}ner-finetune-{datetime.datetime.now().strftime('%Y-%m-%d-%H%M%S')}\"\n", ")\n", "\n", "finetuned_predictor = HuggingFaceModel(\n", " model_data=ner_estimator.model_data,\n", " source_dir=\"containers/entity_recognition/finetuning\",\n", " entry_point=\"inference.py\",\n", " role=IAM_ROLE,\n", " py_version=\"py38\",\n", " pytorch_version=\"1.10.2\",\n", " transformers_version=\"4.17.0\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "ced89c5c", "metadata": {}, "outputs": [], "source": [ "finetuned_predictor.deploy(\n", " initial_instance_count=1,\n", " instance_type=inference_instance_type,\n", " endpoint_name=endpoint_name_finetune,\n", ")\n", "\n", "time.sleep(10)" ] }, { "cell_type": "markdown", "id": "78973c8b", "metadata": {}, "source": [ "Before using the test examples to query the deployed endpoint, we firstly prepare the `test.txt` into the right format. We create a list of words and a list of these words entity labels for each sentence. We store this in a `pandas.DataFrame` by reading the `test.txt` and reading each sentence as a row." ] }, { "cell_type": "code", "execution_count": null, "id": "c48f9575", "metadata": {}, "outputs": [], "source": [ "import itertools\n", "import pandas as pd\n", "\n", "\n", "def get_tokens_and_ner_tags(filename):\n", " with open(filename, \"r\", encoding=\"utf8\") as f:\n", " lines = f.readlines()\n", " split_list = [list(y) for x, y in itertools.groupby(lines, lambda z: z == \"\\n\") if not x]\n", " tokens = [[x.split(\"\\t\")[0].split(\"en:\")[1] for x in y] for y in split_list]\n", " entities = [[x.split(\"\\t\")[1][:-1] for x in y] for y in split_list]\n", " return pd.DataFrame({\"tokens\": tokens, \"ner_tags\": entities})" ] }, { "cell_type": "code", "execution_count": null, "id": "a3b6cf07", "metadata": {}, "outputs": [], "source": [ "test_data = get_tokens_and_ner_tags(\"data/wikiann/test/test.txt\")" ] }, { "cell_type": "code", "execution_count": null, "id": "673f9116", "metadata": {}, "outputs": [], "source": [ "test_data" ] }, { "cell_type": "code", "execution_count": null, "id": "6a1f2e4f", "metadata": {}, "outputs": [], "source": [ "content_type = \"application/list-text\"\n", "\n", "\n", "def query_endpoint(payload, endpoint_name):\n", " client = boto3.client(\"runtime.sagemaker\")\n", " response = client.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " ContentType=content_type,\n", " Body=json.dumps(payload).encode(\"utf-8\"),\n", " )\n", " return response\n", "\n", "\n", "def parse_response(query_response):\n", " model_predictions = json.loads(query_response[\"Body\"].read())\n", " predicted_label = model_predictions[\"predict_label\"]\n", " token = model_predictions[\"token\"]\n", " word_id = model_predictions[\"word_id\"]\n", " return predicted_label, token, word_id" ] }, { "cell_type": "markdown", "id": "1623954d", "metadata": {}, "source": [ "Now we query the endpoint. Each text string (corresponding to each row in `test.txt`) is tokenzied as one or multiple tokens that could be sent into Transformer. When one text string is tokenzied as multiple tokens (for an example, text string `R.H.` is tokensized as `R`, `.`, `H`, `.`), each of the four tokens gets a predicted name entity tag. In this case, we need duplicated the ground truth name entity tag of the text string for all the four tokens. As a result, the number of predicted and ground truth name entity tags is the same, and thus a evalution score can be computed. The predicted result `word_id` is used to identify the tokens that belong to the same text string." ] }, { "cell_type": "code", "execution_count": null, "id": "dd61b6d1", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import json\n", "\n", "batch_size = 10\n", "num_examples = test_data.shape[0]\n", "predicted_label, token, word_id = [], [], []\n", "\n", "for i in np.arange(0, num_examples, step=batch_size):\n", " query_response_batch = query_endpoint(\n", " test_data.iloc[i : (i + batch_size), :].tokens.values.tolist(),\n", " endpoint_name_finetune,\n", " )\n", "\n", " predicted_label_batch, token_batch, word_id_batch = parse_response(query_response_batch)\n", " predicted_label.extend(predicted_label_batch)\n", " token.extend(token_batch)\n", " word_id.extend(word_id_batch)" ] }, { "cell_type": "markdown", "id": "30c5ecc4", "metadata": {}, "source": [ "The returned predictions contain `predicted_label`, `token`, and `word_id`, each of which has the same number of rows (sentences) in `test_data`. For each element in the `predicted_label` (or `token` or `word_id`), it is another list, where each element corresponds to a text string in the corresponding sentence. Let's first do a sanity checking on the number of predictions being equal to the number of rows in `test_data`." ] }, { "cell_type": "code", "execution_count": null, "id": "158b1ead", "metadata": {}, "outputs": [], "source": [ "assert len(predicted_label) == len(token) == len(word_id) == test_data.shape[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "bfbe0740", "metadata": {}, "outputs": [], "source": [ "def tokenize_and_align_labels(examples, word_ids_all):\n", " label_all_tokens = True\n", " labels = []\n", " for i, label in enumerate(examples[\"ner_tags\"]):\n", " word_ids = word_ids_all[i]\n", " previous_word_idx = None\n", " label_ids = []\n", " for word_idx in word_ids:\n", " if word_idx is None:\n", " label_ids.append(-100)\n", " elif label[word_idx] == \"0\":\n", " label_ids.append(0)\n", " elif word_idx != previous_word_idx:\n", " label_ids.append(label[word_idx])\n", " else:\n", " label_ids.append(label[word_idx] if label_all_tokens else -100)\n", " previous_word_idx = word_idx\n", " labels.append(label_ids)\n", " return labels" ] }, { "cell_type": "markdown", "id": "5996879a", "metadata": {}, "source": [ "Next, because each text string can be tokenized into one or multiple tokens. We need duplicate the ground truth name entity tag of the text string to all the tokens that are associated to it. " ] }, { "cell_type": "code", "execution_count": null, "id": "567bbb75", "metadata": {}, "outputs": [], "source": [ "labels_gt = tokenize_and_align_labels(test_data, word_id)" ] }, { "cell_type": "markdown", "id": "500a14ae", "metadata": {}, "source": [ "Let's do another sanity checking that within each sentence, the number of tokens (word id) equals to the number of ground truth name entity tags we just created." ] }, { "cell_type": "code", "execution_count": null, "id": "5544f40c", "metadata": {}, "outputs": [], "source": [ "for idx, i in enumerate(predicted_label):\n", " assert len(i) == len(token[idx]) == len(word_id[idx]) == len(predicted_label[idx])" ] }, { "cell_type": "markdown", "id": "5793095c", "metadata": {}, "source": [ "Now we load evaluation metric to compute evaluation scores." ] }, { "cell_type": "code", "execution_count": null, "id": "630c91c8", "metadata": {}, "outputs": [], "source": [ "from datasets import load_metric" ] }, { "cell_type": "code", "execution_count": null, "id": "fb8d220d", "metadata": {}, "outputs": [], "source": [ "metric = load_metric(\"seqeval\")" ] }, { "cell_type": "markdown", "id": "55695ac4", "metadata": {}, "source": [ "For details of evaluaton metrics, please check the [official documentation](https://huggingface.co/spaces/evaluate-metric/seqeval). For the overall precision, recall, F1, and accuracy, larger value indicates better performance." ] }, { "cell_type": "code", "execution_count": null, "id": "85fccabd", "metadata": {}, "outputs": [], "source": [ "token_all, predict_all, groundtruth_all = [], [], []\n", "\n", "for idx, i in enumerate(predicted_label):\n", " tmp_token, tmp_predict, tmp_gt = [], [], []\n", " for idx2, each_token in enumerate(token[idx]):\n", " if each_token in [\"[CLS]\", \"[SEP]\"]: # exclude the CLS and SEP tokens\n", " continue\n", " assert len(i) == len(labels_gt[idx]) == len(token[idx])\n", " tmp_token.append(each_token)\n", " tmp_predict.append(i[idx2])\n", " tmp_gt.append(labels_gt[idx][idx2])\n", " assert len(tmp_token) == len(tmp_predict) == len(tmp_gt)\n", " token_all.append(tmp_token)\n", " predict_all.append(tmp_predict)\n", " groundtruth_all.append(tmp_gt)\n", "\n", "assert [-100 not in x for x in groundtruth_all]" ] }, { "cell_type": "code", "execution_count": null, "id": "3dc8d912", "metadata": {}, "outputs": [], "source": [ "metrics = metric.compute(predictions=predict_all, references=groundtruth_all)\n", "result = {\n", " \"precision\": [metrics[\"overall_precision\"]],\n", " \"recall\": [metrics[\"overall_recall\"]],\n", " \"f1\": [metrics[\"overall_f1\"]],\n", " \"accuracy\": [metrics[\"overall_accuracy\"]],\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "78cceac7", "metadata": {}, "outputs": [], "source": [ "result = pd.DataFrame.from_dict(result, orient=\"index\", columns=[\"No HPO\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "f81aec7c", "metadata": {}, "outputs": [], "source": [ "result" ] }, { "cell_type": "markdown", "id": "fd287c6e", "metadata": {}, "source": [ "## 4. Finetune the pre-trained model on a custom dataset with automatic model tuning (AMT)\n", "\n", "Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs." ] }, { "cell_type": "code", "execution_count": null, "id": "50fece72", "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner\n", "\n", "hyperparameters_range = {\n", " \"learning-rate\": ContinuousParameter(1e-5, 0.1, scaling_type=\"Logarithmic\"),\n", " \"weight-decay\": ContinuousParameter(1e-6, 1e-2, scaling_type=\"Logarithmic\"),\n", "}\n", "\n", "hyperparameters = {\n", " \"pretrained-model\": \"distilbert-base-uncased\",\n", " \"num-train-epochs\": 3,\n", " \"batch-size\": 16,\n", " \"token-column-name\": \"tokens\",\n", " \"tag-column-name\": \"ner_tags\",\n", " \"early-stopping-patience\": 3,\n", "}" ] }, { "cell_type": "markdown", "id": "3ccd507c", "metadata": {}, "source": [ "### 4.1. Fine-tuning with hyperparameter optimization" ] }, { "cell_type": "code", "execution_count": null, "id": "cadaf327", "metadata": {}, "outputs": [], "source": [ "tuning_job_name = f\"{config.SOLUTION_PREFIX}-ner-hpo\"\n", "\n", "\n", "estimator = HuggingFace(\n", " pytorch_version=\"1.10.2\",\n", " py_version=\"py38\",\n", " transformers_version=\"4.17.0\",\n", " entry_point=\"training.py\",\n", " source_dir=\"containers/entity_recognition/finetuning\",\n", " hyperparameters=hyperparameters,\n", " role=IAM_ROLE,\n", " instance_count=1,\n", " instance_type=training_instance_type,\n", " output_path=f\"s3://{DEFAULT_BUCKET}/{prefix}/output\",\n", " code_location=f\"s3://{DEFAULT_BUCKET}/{prefix}/output\",\n", " tags=[{\"Key\": config.TAG_KEY, \"Value\": config.SOLUTION_PREFIX}],\n", " sagemaker_session=sess,\n", " volume_size=30,\n", " env={\"MMS_DEFAULT_RESPONSE_TIMEOUT\": \"500\"},\n", ")\n", "\n", "tuner = HyperparameterTuner(\n", " estimator,\n", " \"f1\",\n", " hyperparameters_range,\n", " [{\"Name\": \"f1\", \"Regex\": \"'eval_f1': ([0-9\\\\.]+)\"}],\n", " max_jobs=6,\n", " max_parallel_jobs=3,\n", " objective_type=\"Maximize\",\n", " base_tuning_job_name=tuning_job_name,\n", ")\n", "\n", "tuner.fit(\n", " {\n", " \"train\": f\"s3://{DEFAULT_BUCKET}/{prefix}/train/\",\n", " \"validation\": f\"s3://{DEFAULT_BUCKET}/{prefix}/validation/\",\n", " },\n", " logs=True,\n", ")" ] }, { "cell_type": "markdown", "id": "979e1eb0", "metadata": {}, "source": [ "Fetch the exact tuning job name." ] }, { "cell_type": "code", "execution_count": null, "id": "883af180", "metadata": {}, "outputs": [], "source": [ "sm_client = boto3.Session().client(\"sagemaker\")\n", "\n", "tuning_job_name = tuner.latest_tuning_job.name\n", "tuning_job_name" ] }, { "cell_type": "code", "execution_count": null, "id": "c6e7117e", "metadata": {}, "outputs": [], "source": [ "tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(\n", " HyperParameterTuningJobName=tuning_job_name\n", ")\n", "\n", "status = tuning_job_result[\"HyperParameterTuningJobStatus\"]\n", "if status != \"Completed\":\n", " print(\"Reminder: the tuning job has not been completed.\")\n", "\n", "job_count = tuning_job_result[\"TrainingJobStatusCounters\"][\"Completed\"]\n", "print(\"%d training jobs have completed\" % job_count)\n", "\n", "is_maximize = (\n", " tuning_job_result[\"HyperParameterTuningJobConfig\"][\"HyperParameterTuningJobObjective\"][\"Type\"]\n", " != \"Maximize\"\n", ")\n", "objective_name = tuning_job_result[\"HyperParameterTuningJobConfig\"][\n", " \"HyperParameterTuningJobObjective\"\n", "][\"MetricName\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "030a083d", "metadata": {}, "outputs": [], "source": [ "tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)\n", "\n", "full_df = tuner_analytics.dataframe()\n", "\n", "if len(full_df) > 0:\n", " df = full_df[full_df[\"FinalObjectiveValue\"] > -float(\"inf\")]\n", " if len(df) > 0:\n", " df = df.sort_values(\"FinalObjectiveValue\", ascending=False)\n", " print(\"Number of training jobs with valid objective: %d\" % len(df))\n", " print(\n", " {\n", " \"lowest\": min(df[\"FinalObjectiveValue\"]),\n", " \"highest\": max(df[\"FinalObjectiveValue\"]),\n", " }\n", " )\n", " pd.set_option(\"display.max_colwidth\", -1) # Don't truncate TrainingJobName\n", " else:\n", " print(\"No training jobs have reported valid results yet.\")\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "dc846703", "metadata": {}, "outputs": [], "source": [ "df = df[df[\"TrainingJobStatus\"] == \"Completed\"] # filter out the failed jobs\n", "output_path_best_tuning_job = os.path.join(\n", " f\"s3://{DEFAULT_BUCKET}/{prefix}/output\", df[\"TrainingJobName\"].iloc[0], \"output\"\n", ")\n", "\n", "print(f\"The output path of the best model from the hpo tuning is: {output_path_best_tuning_job}\")" ] }, { "cell_type": "markdown", "id": "9cf7b9f4", "metadata": {}, "source": [ "### 4.2. Deploy & run Inference on the fine-tuned model" ] }, { "cell_type": "code", "execution_count": null, "id": "af833417", "metadata": {}, "outputs": [], "source": [ "endpoint_name_hpo = f\"{config.SOLUTION_PREFIX}-ner-hpo-endpoint\"\n", "\n", "tuning_best_model = HuggingFaceModel(\n", " model_data=os.path.join(output_path_best_tuning_job, \"model.tar.gz\"),\n", " source_dir=\"containers/entity_recognition/finetuning\",\n", " entry_point=\"inference.py\",\n", " role=IAM_ROLE,\n", " py_version=\"py38\",\n", " pytorch_version=\"1.10.2\",\n", " transformers_version=\"4.17.0\",\n", ")\n", "\n", "finetuned_predictor_hpo = tuning_best_model.deploy(\n", " instance_type=inference_instance_type,\n", " endpoint_name=endpoint_name_hpo,\n", " initial_instance_count=1,\n", ")\n", "\n", "time.sleep(10)" ] }, { "cell_type": "code", "execution_count": null, "id": "0e6cb3ce", "metadata": {}, "outputs": [], "source": [ "content_type = \"application/list-text\"\n", "\n", "batch_size = 10\n", "num_examples = test_data.shape[0]\n", "predicted_label_hpo, token_hpo, word_id_hpo = [], [], []\n", "for i in np.arange(0, num_examples, step=batch_size):\n", " query_response_batch = query_endpoint(\n", " test_data.iloc[i : (i + batch_size), :].tokens.values.tolist(),\n", " endpoint_name_hpo,\n", " )\n", "\n", " predicted_label_batch, token_batch, word_id_batch = parse_response(query_response_batch)\n", " predicted_label_hpo.extend(predicted_label_batch)\n", " token_hpo.extend(token_batch)\n", " word_id_hpo.extend(word_id_batch)" ] }, { "cell_type": "code", "execution_count": null, "id": "f115cf54", "metadata": {}, "outputs": [], "source": [ "token_all_hpo, predict_all_hpo, groundtruth_all_hpo = [], [], []\n", "\n", "for idx, i in enumerate(predicted_label_hpo):\n", " tmp_token, tmp_predict, tmp_gt = [], [], []\n", " for idx2, each_token in enumerate(token_hpo[idx]):\n", " if each_token in [\"[CLS]\", \"[SEP]\"]:\n", " continue\n", " assert len(i) == len(labels_gt[idx]) == len(token_hpo[idx])\n", " tmp_token.append(each_token)\n", " tmp_predict.append(i[idx2])\n", " tmp_gt.append(labels_gt[idx][idx2])\n", " assert len(tmp_token) == len(tmp_predict) == len(tmp_gt)\n", " token_all_hpo.append(tmp_token)\n", " predict_all_hpo.append(tmp_predict)\n", " groundtruth_all_hpo.append(tmp_gt)\n", "\n", "assert [-100 not in x for x in groundtruth_all_hpo]" ] }, { "cell_type": "code", "execution_count": null, "id": "fb2b1ceb", "metadata": {}, "outputs": [], "source": [ "metrics_hpo = metric.compute(predictions=predict_all_hpo, references=groundtruth_all)\n", "result_hpo = {\n", " \"precision\": [metrics_hpo[\"overall_precision\"]],\n", " \"recall\": [metrics_hpo[\"overall_recall\"]],\n", " \"f1\": [metrics_hpo[\"overall_f1\"]],\n", " \"accuracy\": [metrics_hpo[\"overall_accuracy\"]],\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "5254c911", "metadata": {}, "outputs": [], "source": [ "result_hpo = pd.DataFrame.from_dict(result_hpo, orient=\"index\", columns=[\"With HPO\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "b150980d", "metadata": {}, "outputs": [], "source": [ "pd.concat([result, result_hpo], axis=1)" ] }, { "cell_type": "markdown", "id": "165f8af4", "metadata": {}, "source": [ "We can see results with hyperparameter optimization shows better performance on the hold-out test data." ] }, { "cell_type": "markdown", "id": "77616c0d", "metadata": {}, "source": [ "## 4.3. Clean Up the endpoint\n", "\n", "When you've finished with the summarization endpoint (and associated\n", "endpoint-config), make sure that you delete it to avoid accidental\n", "charges." ] }, { "cell_type": "code", "execution_count": null, "id": "154de147", "metadata": {}, "outputs": [], "source": [ "# # Delete the SageMaker endpoint and the attached resources\n", "sagemaker_client = boto3.client(\"sagemaker\")\n", "\n", "finetuned_predictor.delete_model()\n", "sagemaker_client.delete_endpoint(\n", " EndpointName=endpoint_name_finetune\n", ") ## cannot call finetuned_predictor.delete_endpoint() because 'HuggingFaceModel' object has no attribute 'delete_endpoint'\n", "sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_finetune)\n", "\n", "finetuned_predictor_hpo.delete_model()\n", "sagemaker_client.delete_endpoint(EndpointName=endpoint_name_hpo)\n", "sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_hpo)" ] }, { "cell_type": "markdown", "id": "6fdbedb3", "metadata": {}, "source": [ "## Next Stage\n", "\n", "We've just looked at how you can extract named entities and noun chunks\n", "from a document. Up next we look at a technique that can be used to\n", "classify relationships between entities.\n", "\n", "[Click here to continue with Relation Extraction.](./5_relationship_extraction.ipynb)" ] }, { "cell_type": "markdown", "id": "acfed0df", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_entity_recognition.ipynb)\n" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "kernelspec": { "display_name": "Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/pytorch-1.13-cpu-py39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }

archived/identify_key_insights_from_textual_document/document_entity_recognition.ipynb (1,896 lines of code) (raw):